{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "521a8e94",
   "metadata": {},
   "source": [
    "### Gradio and HuggingFace\n",
    "\n",
    "In this demo, we show how to build ready to deploy or use deep learning models. \n",
    "\n",
    "Hugging Face hosts thousands of pre-trained models in [Model Hub](https://huggingface.co/models). They also built high-level APIs so we can easily use and deploy pre-trained models using [Pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines). \n",
    "\n",
    "`gradio` provides APIs so we can easily build web applications that use our pre-trained models from Hugging Face. `gradio` also provides APIs so we can easily incorporate input and output web UIs.\n",
    "\n",
    "After building the `gradio`, we can have permanent hosting using [Hugging Face Spaces](https://huggingface.co/blog/gradio-spaces). \n",
    "\n",
    "Let us first install Hugging Face `transformers` and `gradio`.\n",
    "\n",
    "**Note:** For some examples, it is best to launch the app in another tab to enable access to the required inputs such as microphone or webcam. Running the app may also lock the `python` kernel and the notebook becomes unresponsive. In that case, please restart the kernel and clear the output. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54da9029",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install transformers --upgrade\n",
    "!pip install gradio --upgrade\n",
    "!pip install torch torchvision torchaudio --upgrade\n",
    "!pip install torchao --upgrade"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a21369ff",
   "metadata": {},
   "source": [
    "#### Hello world in gradio\n",
    "\n",
    "As a tradition, let us build the simplest `gradio` app. It accepts a `text` input and calls the `greet()` function to process this input and convert into another text. The output of `greet()` becomes the output of the `gradio` app.\n",
    "\n",
    "To see our application, we call `launch()` after constructing our `gradio` `Interface`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5c2c37a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import gradio as gr\n",
    "\n",
    "def greet(name):\n",
    "    return \"Hello \" + name + \"!!\"\n",
    "\n",
    "gr.Interface(fn=greet, inputs=\"text\", outputs=\"text\").launch()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74361507",
   "metadata": {},
   "source": [
    "#### Object Recognition using ResNet18\n",
    "\n",
    "In our discussion about PyTorch, we used a pre-trained ResNet18 model from `torchvision`. We use `jupyter` notebook to show the results. The `jupyter` notebook is not an application that we can deploy and other people use with ease. The same with Google's colab. \n",
    "\n",
    "In this example, we use `gradio` to build a simple app that an end user can easily interact with. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d73061fd",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import gradio as gr\n",
    "import torch \n",
    "import torchvision\n",
    "import torchvision.transforms as transforms\n",
    "import requests\n",
    "from torchvision.models import resnet18, ResNet18_Weights\n",
    "\n",
    "\n",
    "normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],\n",
    "                                 std=[0.229, 0.224, 0.225])\n",
    "\n",
    "resnet = resnet18(weights=ResNet18_Weights.IMAGENET1K_V1)\n",
    "resnet.eval()\n",
    "\n",
    "# Download human-readable labels for ImageNet.\n",
    "response = requests.get(\"https://git.io/JJkYN\")\n",
    "labels = response.text.split(\"\\n\")\n",
    "\n",
    "def classify(img):\n",
    "    # By default, gradio image is numpy\n",
    "    img = torch.from_numpy(img)\n",
    "    # Numpy image is channel last. PyTorch is channel 1st.\n",
    "    img = img.permute(2, 0, 1)\n",
    "    \n",
    "    # The transforms before prediction\n",
    "    img = torchvision.transforms.Resize(256, antialias=True)(img)\n",
    "    img = torchvision.transforms.CenterCrop(224)(img).float()/255.\n",
    "    img = normalize(img)\n",
    "    \n",
    "    # We insert batch size of 1\n",
    "    img = img.unsqueeze(0)\n",
    "    \n",
    "    # The actual prediction\n",
    "    with torch.no_grad():\n",
    "        pred = resnet(img)\n",
    "    \n",
    "    # Convert the prediction to probabilities\n",
    "    pred = torch.nn.functional.softmax(pred, dim=1)\n",
    "    # Remove the batch dim. torch.squeeze() can also be used.\n",
    "    pred = pred.squeeze()\n",
    "    \n",
    "    # torch to numpy space\n",
    "    pred = pred.cpu().numpy()\n",
    "    \n",
    "    return {labels[i]: float(pred[i]) for i in range(1000)}\n",
    "    \n",
    "\n",
    "gr.Interface(fn=classify, \n",
    "             inputs=\"image\",\n",
    "             outputs=gr.Label(num_top_classes=5),\n",
    "             title=\"1k Object Recognition\",\n",
    "             examples=['assets/wonder_cat.jpg', 'assets/aki_dog.jpg','assets/birdie1.jpg'],\n",
    "             description=\"Demonstrates a pre-trained model from torchvision for image classification.\",\n",
    "             allow_flagging=\"never\").launch(inbrowser=True, share=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea09cf26",
   "metadata": {},
   "source": [
    "#### Using HuggingFace and Gradio\n",
    "\n",
    "Loading a pre-trained model from torchvision, pre-processing the input, and post processing the output are all messy. Sometimes, we just want to load and use a machine learning model. Hugging Face provides a shortcut for all these steps through the use of `pipeline`. In `pipeline`, we supply the task name and the pre-trained model that is stored in Hugging Face Model Hub.\n",
    "\n",
    "In this example, we use a much better model compared to ResNet18. It is called [BEIT](https://arxiv.org/abs/2106.08254) and can classify objects up to about 22k categories. We construct the `gradio` app by calling `from_pipeline()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "510928fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "import gradio as gr\n",
    "from transformers import pipeline\n",
    "\n",
    "pipe = pipeline(\"image-classification\", \n",
    "                 # model that can do 22k-category classification\n",
    "                 model=\"microsoft/beit-base-patch16-224-pt22k-ft22k\",)\n",
    "gr.Interface.from_pipeline(pipe, \n",
    "                           title=\"22k Image Classification\",\n",
    "                           description=\"Object Recognition using Microsoft BEIT\",\n",
    "                           examples = ['assets/wonder_cat.jpg', 'assets/aki_dog.jpg','assets/birdie1.jpg'],\n",
    "                           allow_flagging=\"never\").launch(inbrowser=True, share=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb86c1b6",
   "metadata": {},
   "source": [
    "#### Automatic Speech Recognition (ASR)\n",
    "\n",
    "Let us shift to audio or speech domain. In this example, we demonstrate an Automatic Speech Recognition (ASR) system. We will use our microphone to record audio which is then converted to text using ASR. In this example, best to open the application in another browser tab by setting `inbrowser=True`.\n",
    "\n",
    "Before running the `gradio` app, this ASR requires `sentencepice` module. Let us install it first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d73f6123",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install sentencepiece --upgrade"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6ad04d0",
   "metadata": {},
   "source": [
    "In this ASR, we use OpenAI Whisper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66d394c8",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import gradio as gr\n",
    "from transformers import pipeline\n",
    "\n",
    "pipe = pipeline(task=\"automatic-speech-recognition\", \n",
    "                model=\"openai/whisper-tiny\")\n",
    "gr.Interface.from_pipeline(pipe,\n",
    "                           title=\"Automatic Speech Recognition (ASR)\",\n",
    "                           description=\"Using pipeline with OpenAI Whisper\",\n",
    "                           examples=['assets/ljspeech.wav',],\n",
    "                           ).launch(inbrowser=True, share=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6cfb3b5",
   "metadata": {},
   "source": [
    "#### Text to Speech (TTS)\n",
    "\n",
    "Let us do the reverse of ASR or Text to Speech (TTS). In this example, we supply text and this text is converted to speech using the voice of Linda Johnson. We use a pre-trained model of [FastSpeech2](https://arxiv.org/abs/2006.04558) that is provided by Facebook in Model Hub.\n",
    "\n",
    "In this example, we use [`load()`](https://gradio.app/docs/#load) method to load the pre-trained model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "276b27ea",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import gradio as gr\n",
    "gr.load(\"huggingface/facebook/fastspeech2-en-ljspeech\", \n",
    "        description=\"TTS using FastSpeech2\",\n",
    "        title=\"Text to Speech (TTS)\",\n",
    "        examples=[[\"but the first bible actually dated which also was printed at maintz by peter schoeffer in the year fourteen sixty two\"]]\n",
    "        ).launch(inbrowser=True, share=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "mspeech",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
